Skip to content

PCI: hv: Reserve hv_pci swiotlb from buddy and publish via sysfs#260

Draft
benhillis wants to merge 1 commit into
linux-msft-wsl-6.18.yfrom
user/benhillis/hv-pci-swiotlb-fix
Draft

PCI: hv: Reserve hv_pci swiotlb from buddy and publish via sysfs#260
benhillis wants to merge 1 commit into
linux-msft-wsl-6.18.yfrom
user/benhillis/hv-pci-swiotlb-fix

Conversation

@benhillis
Copy link
Copy Markdown
Member

@benhillis benhillis commented May 26, 2026

Summary

Reserve the dedicated hv_pci swiotlb pool from the buddy allocator at core_initcall time and publish the resulting (base, size) under /sys/bus/vmbus/drivers/hv_pci/swiotlb_{base,size} so userspace can forward the real GPA to the host-side device backend. This replaces the old "host dictates a GPA" flow.

Why

WSL container test runs intermittently saw the guest die with WorkerExitType=StoppedOnReset WorkerExitDetail=TripleFault WorkerExitInitiator=GuestOS between io scheduler mq-deadline registered and the next initcall. Root cause: memblock_reserve() accepts ranges that are not actually backed by EPT, and swiotlb_create_pool() -> swiotlb_init_io_tlb_pool() then memsets 64 MiB of unbacked pages.

What changed

  • hv_pci_swiotlb=<size> is now the only accepted form. early_hv_pci_swiotlb() parses with memparse(p, &end) and rejects any unconsumed trailing characters with pr_warn, so the legacy <base>,<size> form (which memparse(p, NULL) would otherwise silently treat as just the leading hex base) is no longer accepted.
  • core_initcall(hv_pci_swiotlb_alloc_pool) asks the buddy allocator for a contiguous DMA32 range via alloc_contig_pages(__GFP_DMA32 | __GFP_ZERO, first_online_node, &node_online_map). __GFP_ZERO faults every page in via the page allocator, so by the time swiotlb_create_pool() runs the memory is known-good. Kernel ownership keeps Hyper-V page reporting from yanking the backing. Running at core_initcall (initcall level 1) gives us the earliest possible shot at a fresh, mostly-empty DMA32 zone.
  • Allocator path gated on CONFIG_CONTIG_ALLOC with a no-op stub fallback.
  • The hv_pci_swiotlb= early_param and the alloc core_initcall are wrapped in #ifndef MODULE, because both early_param and core_initcall are unavailable from modules (otherwise core_initcall redefines module_init's init_module). Module builds compile cleanly and fall back to the default swiotlb pool. The WSL config uses CONFIG_PCI_HYPERV=y, so the feature is active there.
  • (base, size) published via DRIVER_ATTR_RO once vmbus_driver_register() succeeds. An hv_pci_swiotlb_published flag makes publish()/unpublish() symmetric and idempotent, so a partial sysfs-create failure cleans up only what it created and module exit can't double-unpublish.

Validation

  • scripts/checkpatch.pl --strict -g HEAD -> 0 errors, 0 warnings, 0 checks, 203 lines checked.
  • make W=1 drivers/pci/controller/pci-hyperv.o is clean with both CONFIG_PCI_HYPERV=y and CONFIG_PCI_HYPERV=m (arm64 cross-build).
  • Boot-tested under WSL2 with swiotlb=force hv_pci_swiotlb=64M:
    • dmesg: hv_pci: reserved swiotlb pool [0x0000000008000000..0x000000000c000000)
    • /sys/bus/vmbus/drivers/hv_pci/swiotlb_base -> 0x8000000
    • /sys/bus/vmbus/drivers/hv_pci/swiotlb_size -> 67108864

Notes

swiotlb has no destroy_pool() counterpart to swiotlb_create_pool(), so the backing pages are deliberately leaked on driver unload. hv_pci is rarely hot-replaced and the pool is bounded (default 64 MiB).

@benhillis benhillis force-pushed the user/benhillis/hv-pci-swiotlb-fix branch from 9077664 to 9bc4922 Compare May 26, 2026 23:19
@benhillis benhillis requested a review from Copilot May 27, 2026 00:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reworks how the hv_pci driver provisions its dedicated swiotlb pool. Instead of accepting a host-supplied GPA (which could be reclaimed by Hyper-V page-reporting and triple-fault the guest), the driver now reserves a contiguous DMA32 range from the buddy allocator at core_initcall time and exposes the resulting base/size via /sys/bus/vmbus/drivers/hv_pci/swiotlb_{base,size} for a userspace agent to forward to the host.

Changes:

  • Replaces <base>,<size> cmdline parsing with size-only hv_pci_swiotlb=<size>, 2 MiB aligned.
  • Adds a core_initcall (hv_pci_swiotlb_alloc_pool) that uses alloc_contig_pages(GFP_KERNEL|__GFP_DMA32|__GFP_ZERO, …) to back the pool, gated by CONFIG_CONTIG_ALLOC with a no-op fallback.
  • Adds DRIVER_ATTR_RO(swiotlb_base/size) sysfs files published after vmbus_driver_register() and removed on exit; backing pages are intentionally leaked on driver unload.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread drivers/pci/controller/pci-hyperv.c
Comment thread drivers/pci/controller/pci-hyperv.c
Comment thread drivers/pci/controller/pci-hyperv.c Outdated
Comment thread drivers/pci/controller/pci-hyperv.c Outdated
@benhillis benhillis force-pushed the user/benhillis/hv-pci-swiotlb-fix branch from 9bc4922 to ac7c4d3 Compare May 27, 2026 15:40
@benhillis benhillis requested a review from Copilot May 27, 2026 15:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread drivers/pci/controller/pci-hyperv.c
@benhillis benhillis force-pushed the user/benhillis/hv-pci-swiotlb-fix branch from ac7c4d3 to d7b840a Compare May 27, 2026 16:42
@benhillis benhillis requested a review from Copilot May 27, 2026 16:45
@benhillis benhillis force-pushed the user/benhillis/hv-pci-swiotlb-fix branch from d7b840a to 498d391 Compare May 27, 2026 16:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Comment thread drivers/pci/controller/pci-hyperv.c
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


/* UMA on WSL; first_online_node biases nothing in practice. */
pages = alloc_contig_pages(nr_pages,
GFP_KERNEL | __GFP_DMA32 | __GFP_ZERO,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why we use __GFP_DMA32?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason mainline swiotlb uses it, the bounce buffer must be reachable by 32-bit-DMA PCI devices, and ZONE_DMA32 is the safe universal home.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any 32-bit-DMA PCI devices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably OK to do this for now, though. We can relax this later if testing shows it's not necessary.

Comment thread drivers/pci/controller/pci-hyperv.c Outdated
Comment thread drivers/pci/controller/pci-hyperv.c
The old early_param parsed hv_pci_swiotlb=<base>,<size> and reserved the
host-supplied physical address with memblock_reserve(), which does not
validate that the range is backed by EPT.  Under Hyper-V page-reporting
the backing for a nominally usable e820 range can be absent, so the
memset() inside swiotlb_init_io_tlb_pool() triple-faulted the guest.

Pick the base in the guest instead:

  * A core_initcall calls alloc_contig_pages(__GFP_DMA32 | __GFP_ZERO)
    for a kernel-owned, contiguous, below-4G range.  __GFP_ZERO faults
    the pages in, and kernel ownership keeps page reporting away.  Gated
    on CONFIG_CONTIG_ALLOC; without it the dedicated pool is skipped.

  * The hv_pci_swiotlb= early_param and the alloc core_initcall are only
    compiled in for built-in builds (#ifndef MODULE), because both
    early_param and core_initcall are unavailable from modules.  Module
    builds compile cleanly and fall back to the default swiotlb pool.

  * (base, size) is exposed via DRIVER_ATTR_RO under
    /sys/bus/vmbus/drivers/hv_pci/swiotlb_{base,size} so userspace can
    forward the real GPA to the host-side device backend.

swiotlb has no destroy_pool() counterpart, so the pages are leaked on
driver unload; hv_pci is rarely hot-replaced and the pool is bounded.

Signed-off-by: Ben Hillis <benhillis@microsoft.com>
@benhillis benhillis force-pushed the user/benhillis/hv-pci-swiotlb-fix branch from 498d391 to 3b14370 Compare May 27, 2026 17:32
@benhillis benhillis requested a review from Copilot May 27, 2026 17:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants